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We  consider  three  important  problems  in  the  analysis  of  categorical  questionnaire 
data.  First,  assessment  of  question  worth  and  variable  selection,  second,  the  assess- 
ment of  question  validity  using  a pretest,  and  third,  discrete  discriminant  analysis 
when  the  data  is  non-ordinal.  The  unifying  approach  used  throughout  is  the  concept  of 
information  theoretic  distance  measures.  Simulations  and  applications  to  real  data 
are  presented. 


1970  AMS  Subject  classification:  Primary  62H30,  62L99,  62F99. 
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S 1 . Introduct ion 

The  analysis  of  categorical  questionnaires  poses  many  interesting  problems  of 
which  we  shall  consider  three:  the  assessment  of  which  questions  are  worthwhile  and 
which  questions  should  be  excluded  (variable  selection),  the  assessment  of  question 
validity  and  overall  questionnaire  validity,  and  the  problem  of  discriminant  analysis 

using  categorical  questionnaire  data.  These  three  problems  are  considered  here  as 

■ 

variants  of  a single  problem  which  we  shall  attack  using  information  theoretic  tech- 
niques. 

The  use  of  information  theoretic  techniques  is  especially  appealing  in  the  anal- 
ysis of  questionnaire  data  since  the  entire  purpose  of  such  data  is  to  answer  some 
specific  queries  and  the  worth  of  each  question  should  be  determined  according  to  how 
much  information  is  supplied  by  the  question  towards  answering  these  queries.  To  make 
this  mathematically  rigorous,  suppose  we  wish  to  decide  whether  a respondent  belongs 
in  group  1 or  group  2 with  respective  generalised  densities  f^  and  f^  with  respect 
to  some  measure  "K.  If  the  prior  probabilities  of  group  1 membership  are  , 

1 = 1,2,  then  the  log  odds  ratio  in  favor  of  group  1 membership  is  in  n^/n^  < If  an 
observation  x is  made  on  the  respondent,  Bayes'  Theorem  may  be  used  to  determine  the 
new  posterior  log  odd  ratio  in  favor  of  group  1 membership.  The  difference  between 

I 

the  posterior  and  prior  log  odds  ratio  is  taken  as  a measure  of  the  amount  of  infor- 
mation supplied  by  the  observation  x for  discrimination  in  favor 

of  group  1 membership.  One  easily  works  out  that  this  difference  as  in(f, (x)/f-(x) ) 
and  this  quantity  is  called  the  information  gain  from  the  observation  in  favor  of  group 
1 membership,  or  simply  the  Information  gain  (cf.  Kullback  (1959)).  The  expected  in- 
formation gain  is  obtained  by  randomizing  x according  to  the  density  f^  obtaining 
the  directed  information  measure 
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IS  f1  <x) 

1<rilt2)  ' J *■  iJsT  fi  (’■»(*')  • 

The  symmetric  measure  of  information  between  the  two  groups  is  called  the  divergence 
between  the  groups  and  is  denoted  by 

o f 1 <x> 

J(frf2)  = I(f1|f2)+I(f2|f1)  = j (f  j (x)  -f  2 (x)  ) in  a (dx)  . 

For  the  categorical  questionnaires  we  shall  be  considering,  we  take  "K  as  counting 
measure  and  the  integrals  become  summations;  J(f^,f2)=Z  (p^-q^Jnfo^/q^)  where 
f^x^  = P1[X=xi]  = pt  and  f^Xj)  = p2lX=xiJ  = q1  . 

52.  A measure  of  question  validity 

We  assume  that  the  purpose  of  the  questionnaire  is  to  obtain  a summary  index 
of  how  much  of  a certain  attribute  is  possessed  by  the  respondent.  For  example  a 
psychiatric  screening  exam  might  measure  how  much  "mental  stress"  is  exhibited  by  a 
respondent,  while  in  an  industrial  context,  a quality  control  checklist  might  measure 
how  much  "propensity  to  fail"  is  exhibited  by  a certain  machine.  Employment  screen- 
ing exams  which  hope  to  measure  a candidate's  potential  job  success  are  another 
example. 

A commom method  of  assessing  the  reliability  and/or  validity  of  a particular  question  in 
questionnaires  such  as  those  outlined  above  is  to  compare  a respondent's  overall  questionnaire 
score  with  the  score  obtained  on  that  particular  question.  The  method  we  propose  here 
is  in  this  vein.  We  divide  the  respondents  into  quartlles,  , Q3  ant*  Q4  » based 

upon  their  overall  questionnaire  scores  excluding  the  question  we  wish  to  assess,  and 
we  measure  the  worth  of  that  particular  question  by  the  amount  of  information  it 
possesses  for  discriminating  between  these  high  and  low  scorers. 


Let  » • ’ ' * Pfc  dei*°te  the  proportion  of  high  scorers  (group  Q^)  and 

^ , q2  \ denote  the  proportion  of  low  scorers  (Q^  responding  to  the  k 

answers  to  the  question  under  consideration,  and  suppose  there  are  n respondents 
in  each  of  the  reference  groups  and  . A measure  of  the  amount  of  information 
in  the  question  for  discriminating  between  and  (and  hence  a measure  of  the 
worth  of  the  question)  is  given  by  taking  a linear  function  of  the  estimated  in- 
formation theoretic  divergence  between  and  Q4  . We  define  the  D-value  of  the 
question  to  be 

k 

D = n/2  £ (p  -q  )in(p  /q  ) . 
i=l  1 1 x 1 

We  would  discard  a question  if  D is  too  close  to  zero  Indicating  there  is  not 

sufficient  information  furnished  by  the  question  to  discriminate  between  the  high 

and  low  questionnaire  scorers.  Kullback  (1959)  shows  that  under  a null  hypothesis 

of  (p1  > ....  Py)  * (q^  .... , qfc)  (corresponding  to  the  question  having  no  discrlmin- 

2 

atory  value),  the  asymptotic  distribution  of  D is  x (k-1).  Thus,  our  procedure 

2 2 

is  to  retain  a question  only  if  D > xt  (k-1)  where  Xi  (k-1)  is  the  1-a-th 

1 "CL  1 "CL 

2 

quantile  of  the  % (k-1)  distribution.  The  probability  of  erroneously  including 
a nondiscriminating  question  by  using  this  procedure  converges  to  a as  '.he  sample 
size  increases.  Another  advantage  of  this  procedure  is  that  it  should  aid  in  estab- 
lishing questionnaire  validity  since  using  this  procedure  includes  only  questions  of 
proven  discriminatory  worth  in  the  final  questionnaire.  For  questionnaires  such  as 
employment  screening  questionnaires  in  which  for  legal  reasons  each  question's  in- 
clusion must  be  Justified,  this  method  should  be  useful. 

§3.  Variable  selection;  which  questions  should  be  Included 

We  shall  again  assume  that  the  questionnaire  is  categorical,  and  we  shall  eval- 
uate a question  or  sequence  of  questions  by  how  much  information  is  contained  for 
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discriminating  between  two  pre-given  groups.  (These  may,  of  course,  be  and 
as  in  the  previous  section,  however  any  two  groups  which  we  wish  the  questionnaire 
to  distinguish  will  also  serve  our  purposes.)  We  assume  that  we  are  given  n^ 
respondents  from  group  one  and  n^  respondents  from  group  two,  and  we  wish  to  de- 
velop a sequential  procedure  for  determining  which  questions  to  include  in  the 
questionnaire  analogous  to  the  stepwise  selection  of  variables  in  regression  analysis. 

For  a particular  question  X,  let  J(X)  denote  the  divergence  between  the 
probability  distributions  of  group  1 and  group  2 over  the  question.  I.e. 

J(x)  = E • 

xpl 


. 4 


It  tells  us  the  amount  of  information  in  question  X for  discriminating  between  the 

A A 

groups  1 and  2 with  empirical  response  probabilities  p and  q respectively  over 
the  answers  to  question  X. 

Our  sequential  procedure  begins  by  choosing  for  first  inclusion  the  most  in- 
formative question  X for  discriminating.  In  this  first  step  our  procedure  is 
similar  in  philosophy  to  that  described  by  Levine  (1974),  Brockett,  Haaland  and 
Levine  (1977b)  and  by  Goldstein  and  Dillon  (1977),  (see  also  Goldstein  and  Dillon 
1978),  for  selecting  binary  variables  for  inclusion  in  a multiway  contengency  table 
discrimination  framework.  In  our  case,  however,  we  cannot  assume  that  two  categorical 
questions  have  the  same  number  permissible  categorical  responses,  e.g.,  the  questions 
"Sex"  and  "Income  level"  may  have  markedly  different  number  of  response  categories. 
This  prohibits  us  from  using  the  Goldstein-Dillon  procedure.  We  shall  use  the 
quantity  D(X)  * n^n^n^  + n2)  J (X)  as  a measure  of  information  in  question  X for 

discrimination  (cf.  Kullback  (1959),  Gokhale  and  Kullback  (1978)).  Asymptotically 
2 

D(X)  has  a x (k^-1)  distribution,  and  this  is  why  direct  comparison  of  the  cal- 
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culated  D(X)  values  Is  Impossible.  A question  with  = 11  answers  is  expected 
to  have  a D(X)  value  of  10  while  a question  with  1^=2  would  be  expected  to 
have  a D(X)  value  of  1 under  the  hypothesis  that  the  question  is  not  discrimin- 
ating. This  does  not  directly  imply  however  that  the  question  with  11  answers  is 


more  desirable  than  the  question  with  2 answers. 


Since  D(X)  has  a x (k  -1)  distribution  under  the  null  hypothesis  that  the 

A 


two  groups  respond  the  same  to  the  question,  and  has  a non-central  x (k^-1)  distri- 


bution with  a non-centrality  parameter  equal  to  the  discriminatory  power  of  the 
question  in  the  case  where  the  alternative  hypothesis  holds  and  the  question  actually 


discriminates,  we  shall  use  instead  the  p-value  of  the  D(X)  statistic  as  a measure 


of  discriminatory  power  of  question  X.  If  p =P[X  (k  -1)  > d]  where  d is  the 


observed  value  for  D(X),  then  the  smaller  px  , the  more  informative  is  question 


X.  Although  the  values  D(X)  for  various  questions  X may  not  be  directly  com- 


parable in  general  (as  they  would  be  for  example  if  k was  always  the  same),  the 

A 


p-values  px  are  always  comparable  and  easily  calculated  from  readily  available  x 


tables.  (Alternately,  if  no  tables  are  available,  the  normalized  quantities 


D(X)-(k  -1) 


^ '/T(k  -1) 


would  be  quickly  comparable  variables,  approximately  normal  zero- 


one  for  k large.) 
x 


Using  the  p-values,  which  are  distributed  uniformly  over  [0,1]  under  the  null 
hypothesis  of  no  discriminatory  power,  we  select  as  the  first  question  that  question 


X with  minimum  px  value,  provided  this  px  value  is  significantly  small.  We  can 


assess  significance  for  rain  PY*U.  . by  using  the  distribution  function 

l<X<m  * 


F (t) * 1 - (l-t)m  for  0 < t < 1 as  the  c.d.f.  for  min  p . Thus  the  best  question 


l<X<m 

has  significant  discriminatory  power  at  level  of  significance  a if 


min  p < 1 - (1-a) 


1/m 


l<X<m 


(The  Goldstein-Dillon  procedure  does  not  employ  the  actual 
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distribution  for  their  selection  statistic,  and  hence  will  not  lead  to  a fixed 
type  1 error.)  Having  chosen  the  first  question  for  inclusion  according  to  this 
procedure,  we  select  the  second  question  for  inclusion  as  that  question  which 
yields  the  maximum  additional  information  to  the  already  selected  first  question. 

For  notational  convenience,  relabel  the  questions  so  that  the  first  question  se- 
lected is  called  question  1.  We  look  at  all  question  pairs  (1,Y),  2 < Y < m and 
consider  the  joint  probabilities  p^  and  q^  for  the  two  groups  over  the  possi- 
ble answer  pairs  (x,y)  on  questions  (1,Y). 

If  Ir 

1 Y 

The  quantity  D(l, Y)  = n n / (n  +n.)  £ £ (p  -q  )2n  p /q  is  a measure  of 

1 ^ i z i xy  xy  xy  xy 

the  joint  discriminatory  power  of  the  question  pair  (1,Y),  and  hence  D(1,Y)  -D(l) 

is  a measure  of  the  increase  in  discriminatory  information  obtained  by  adding  question 

Y.  Note  that  D (1,Y)  - D(l)  = n n2/ (i^+n^  £ (p  -q  )£n  (p  /q  ) 

x y xy  xy 

- V2/<n1-h,2)rP](I<pl|x|qY|x)  +„1n2/(n1-h.2)I<1)[I<qY|i|pY|i[)  vher.  pY|x  ,-,d  ,Y|x 

are  the  conditional  probability  distributions  of  groups  1 and  2 respectively  over  the 

answers  to  question  Y given  that  question  1 was  answered  x,  i.e.,  p^|  (y)  = p /p  . 

This  equality  implies  D(1,Y)  -D(l)  > 0 with  equality  only  if  Y contains  no 

additional  information  given  the  answer  to  question  1 (i.e.  the  addition  of  Y can 

only  improve  things).  This  equation  also  shows  that  the  distribution  of  D(1,Y)  - 

2 

D(l)  is  a weighted  sum  of  (non-independent)  x variables,  the  weights  reflecting 

2 

the  probability  of  a particular  response  x Lo  question  1,  and  the  x variable 
measuring  the  information  expected  to  be  added  by  question  Y given  that  particular 
response  x to  question  l.  A stepwise  regression  analogue  would  consider 
(D(1,Y)-D(1))/D(1)  as  a measure  of  increased  discriminatory  power  obtained  by  the 
addition  of  question  Y.  The  distributional  properties  of  this  ratio  have  not  been 
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explored  In  full,  however  when  the  question  responses  are  Independent  variates,  one 
has  D(1,Y)  * D(l)  +D(Y)  so  the  above  ratio  has  (asymptotically)  an  F distribution 
with  parameters  (ky-l,kj-l)  (see  Brockett,  Haaland  and  Levine  (1977a,b)).  When 
the  responses  are  not  independent,  this  procedure  is  conservative  in  that  the  true 
ratio  is  stochastically  dominated  by  the  given  F distribution. 

In  this  paper  we  shall  also  present  a different  approach  based  upon  information 
theoretic  analysis  similar  to  that  used  in  contingency  table  analysis  (cf.  Gokhale 
and  Kullback  (1978)  and  Kullback  (1959)).  This  method  is  more  convenient  to  use  than 
Goldstein  and  Dillon's  technique  since  one  utilizes  the  entire  set  of  answers  to  the 
previous  question  for  determining  the  usefulness  of  a proposed  new  question  rather 
than  having  to  condition  individually  upon  each  possible  response.  Goldstein  and 
Dillon's  technique  would  result  in  different  respondents  obtaining  different  sequences 
of  questions  and  perhaps  different  questions.  Since  the  order  of  presentation  of 
questions  has  been  shown  to  make  a statistically  significant  difference  in  question- 
naire  score  (cf.  Payne  (1951),  Oppenheim  (1966),  and  for  an  up-to-date  bibliography 
and  study,  see  Kalton,  Collins  and  Brook  (1978)).  We  desire  a unified  approach  to 

1 

variable  selection  not  dependent  upon  the  particular  answers  given  to  previous 
questions  of  the  Goldstein-Dillon  method  is  useful  for  expensive  binary  medical  tests. 

Let  jy  denote  the  number  of  respondents  in  group  i (i * 1, 2)  who  answer  j 

to  question  1 and  answer  y to  question  Y,  and  p(i,j,y)  represent  the  corres- 
ponding proportion  of  respondents  in  group  i with  these  answers.  A test  of  the  hy- 
pothesis that  the  inclusion  of  question  Y yields  no  additional  discriminatory  power 
can  be  obtained  by  testing  the  hypothesis  that  the  conditional  distribution  of  Y 

i 

'j  is  Independent  of  the  group  classification  given  the  answer  to  question  1,  i.e. 


hi?y:  p(iyIJ>  ■ p<ilj)p(y|J)  • |j 
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The  worth  of  question  Y is  assessed  by  the  p-value  for  rejecting  . Using  the 

directed  divergence  distance  measure  to  test  yields  the  statistic 

9 *■ 


ki  ki 


I(Y|1)  - 2I(p(ljy)  IP(t|  j,P(y|  ' II  E 


j=i  y=i 


ij  • • jy 


where  the  dot  replacing  a subscript  is  the  usual  convention  for  "sum  over  that  vari- 
able", i.e.  x.,  = Ex,.  etc.  The  distribution  of  I (Y 1 1 ) under  is 

1J‘  y=i  1»Y 

2 

asymotically  X with  ^(1^-1)  degrees  of  freedom  (c.f.  Kullback  1959,  or  Gokhale 
and  Kullback  1978).  Hence  if  py * P[ (k.ky-k^)  > i (Y 1 1 ) ] where  i(Y|l)  is  the  ob- 
served value  for  I(Y|l),  we  select  the  second  question  to  minimize  p^  , 2 < Y < m. 
The  exact  level  of  significance  can  be  found  from  the  distribution  function  F(t)  = 

1 - (1-t)*"1,  0 < t < 1. 

If  the  p-value  for  the  second  question  is  significant,  we  proceed  on  to  select 
the  third  question  by  the  same  procedure.  He  test  if  the  group  classification  and 
the  answer  to  question  Z,  3 < Z < m are  conditionally  independent  given  the  response 
to  questions  1 and  2.  The  information  statistic  is 

2 kl  k2  kZ  x x 

2E  EE  Ex  k^n(x  j X‘JV) 

i=l  j=l  k=l  1=1  J * ijk-  -jk£ 

2 

which  is  x (with  k^k^C^-l)  degrees  of  freedom.  Again,  the  p-values  for  each 
question  3 < Z < m are  compared  to  determine  if  the  minimum  p-value  question  is 
significant.  If  it  is,  we  include  it  as  question  3 and  proceed  until  we  obtain  a 
non-significant  result.  When  we  finally  find  a non-significant  result,  we  quit  add- 
ing questions  and  consider  the  questionnaire  complete.  This  procedure  can  reduce  the 
size  of  the  overall  questionnaire. 


' * 

' * 


One  problem  with  utilizing  contingency  table  methods  of  analysis  is  the  rapid 
proliferation  of  cells,  resulting  in  possible  empty  cells.  Empty  cells  here  won't 


•0-+K ' 
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bother  us,  however  zero  marginals  will.  The  problem  of  zero  marginals,  and  the  re- 
sulting loss  of  degrees  of  freedom  is  discussed  in  Gokhule  and  Kullback  (1978). 

Another  approach  is  to  break  the  questionnaire  into  subsets  of  3 or  4 questions 
each,  each  group  of  questions  being  treated  separately.  For  example,  a group  of 
questions  concerning  economic  status  might  be  treated  separately  from  a group  con- 
cerning health,  which  in  turn  is  treated  separately  from  a group  concerning  education 
level.  Each  group  is  analyzed  to  obtain  the  questions  in  that  particular  group  which 
should  be  included.  The  best  subset  of  each  group  is  then  combined  to  form  the  over- 
all questionnaire.  Still  another  approach  to  the  sparse  cells  problem  is  the  "nearest 
neighbor"  approach  of  Hills  (1966).  One  groups  together  respondents  whose  previous 
answers  differ  in  only  one  place  on  one  question. 

§4.  Discrimination  using  information  gain 

The  problem  of  discriminant  analysis  using  categorical  data  has  attracted  much 
attention  in  the  recent  literature  (c.f.  Lachenbruch  1975,  Goldstein  and  Dillon  1978 
and  the  bibliographies  contained  therein).  If  the  number  of  variables  is  quite  small 
a contingency  table  approach  is  possible.  For  moderate  numbers  of  variables,  a log 
linear  model-  may  be  fitted  assuming  certain  higher  order  interaction  terms  vanish 
(c.f.  Gokhale  and  Kullback  1978,  and  for  computational  techniques,  see  Brockett, 

Charnes  and  Cooper  1979).  If  the  questions  all  have  ranked  responses,  or  perhaps 
binary  responses,  then  Fisher ' s linear  discriminant  function  (LDF)  (Fisher  1936)  has  proven 
to  be  quite  effective  for  classifying  respondents  into  their  correct  group  (c.f. 
Lachenbruch  1975).  Fisher's  LDF  does  not  perform  well,  however,  when  the  answers  are 
not  of  a ranked  character.  Moore  (1973)  refers  to  this  as  reversal  in  the  likelihood 
ratio. 

Discriminant  analysis  for  discrete  data  involves  two  processes;  first  one  must 
score  the  categories,  and  second,  one  must  combine  the  individual  question  scores  to 
obtain  an  overall  questionnaire  score  to  be  used  for  classification.  The  procedure 
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we  shall  discuss  here  involves  information  theoretic  scoring.  This  method  is  effective 
for  non-ordinal  data,  and  outperforms  Fisher's  LDF  when  reversals  are  present. 

Let  Et  = ^ptl  » * • • » ptk  ^ and  3t  = ^qtl qtk  ^ denote  the  group  1 and  group 


t ' ~"t 

2 response  probabilities  for  the  kt  answers  to  question  t.  By  scoring  the  i-th 

Pti 

answer  via  the  Information  gain  J5n  in  favor  of  group  1 membership,  we  may  trans- 


*ti 


form  the  non-ordinal  data  into  ordinal  data.  The  larger  the  score,  the  more  likely  is 
group  1 membership.  (This  is  not  true  with  raw  scoring  if  the  two  groups'  responses 
are  polarized  with  respect  to  each  other.)  For  simplicity  we  shall  assume  con- 

ditional independence  of  the  responses  given  the  group.  This  is  a common  assumption 
in  medical  diagnostics  (c.f.  Warner  et  al  (1961),  Bishop  and  Warner  (1969),  Boyle  et 
al  (1966),  Nugent  et  al  (1964)  or  Reale  et  al  (1968)),  but  could  be  modified  if  ob- 
viously necessary  by  scoring  subcollections  of  non-independent  questions  separately 
and  then  adding  together  the  component  subcollection  scores  to  obtain  an  overall 
questionnaire  score,  or  perhaps  utilizing  LDF  for  ordinal  data,  and  our  method  for  the 
non-ordinal  questions.  We  call  our  method  discrimination  using  information  gain  (DIG). 

If  we  are  given  samples  of  size  n^  and  n^  from  group  1 and  group  2,  we  may 


estimate  p^  and  q^^  from  these  training  samples,  and  pick  a number  s such 


that  e(l,2)  +e(2, 1)  is  minimized,  where  e(l,2)  is  the  percentage  of  the  n^ 


respondents  in  group  1 with  score  < s and  e(2,l)  is  the  percentage  of  the  n„ 


respondents  in  group  2 with  score  > s . We  classify  a respondent  into  group  1 if 

★ 

his  questionnaire  score  is  > s and  into  group  2 otherwise,  so  e(i,  j)  represent 
the  percentage  mlsclassified  as  belonging  to  group  j. 

Simulation  studies  were  run  to  assess  the  power  of  this  procedure  relative  to 
Fisher's  LDF  and  the  discriminant  procedure  available  in  the  SPSS  (Statistical  Package 
for  the  Social  Sciences)  computer  package.  Also  compared  were  the  question  weighting 
schemes  of  RAD  (1970),  SPSS  (c.f.  Cooley  and  Lohnes  1971),  and  the  divergence  weights 
from  $2. 
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The  first  questionnaire  we  simulated  is  presented  in  Table  1.  Each  question  was 
simulated  independently  according  to  the  given  probability  structure.  Note  that 
questions  1,2,3  discriminate  well  but  in  a different  way  than  do  5 and  10.  The 
remaining  questions  discriminate  moderately  well  except  for  6 which  essentially  does 
not  discriminate  the  groups. 


Table  1 : Probability  Structure  of  the  simulated 
questionnaire  I 

Question  No.  Probability  of  response  Probability  of  response  of 

of  group  1 to  part  group  2 to  part 

I II  III  IV  V I II  III  IV  V 

1 

25 

10 

30 

10 

25 

10 

35 

10 

35 

10 

2 

35 

10 

10 

10 

35 

10 

35 

10 

35 

10 

3 

35 

10 

10 

10 

35 

10 

25 

30 

25 

10 

4 

25 

10 

30 

10 

25 

10 

25 

30 

25 

10 

5 

25 

25 

25 

10 

IS 

15 

10 

25 

25 

25 

6 

15 

25  ‘ 

25 

25 

10 

10 

25 

25 

25 

15 

7 

25 

25 

30 

10 

10 

10 

10 

30 

25 

25 

8 

10 

35 

35 

10 

10 

10 

10 

35 

35 

10 

9 

35 

10 

35 

10 

10 

10 

10 

35 

10 

35 

10 

35 

35 

10 

10 

10 

10 

10 

10 

35 

35 

Table  2 shows  the  errors  of  mis classifications  for  a simulation  run  with  100 
members  in  each  group. 
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Table  2:  e(i,j).  Error  of  Misclassif ication  in 


\ of  Group  i into  group  j for  SPSS,  Fisher,  DIG: 
1UU  samples  from  eacn  group. 


Method 

Error 

Fisher 

SPSS 

DIG 

(%) 

• 

t(l,2) 

26 

21 

1 

c (2,1) 

‘ 18 

17 

3 

As  predicted,  DIG  docs  much  better  for  this  type  of  questionnaire  since  DIG  does  not 
depend  upon  the  centroid  separation  of  the  two  groups,  but  rather  the  "information 
gain"  available  from  the  questionnaire.  This  superiority  of  DIG  also  holds  for 
the  determination  of  significant  questions.  As  shown  in  Table  3 the  Rao  and  SPSS 
methods  distort  the  relative  discriminatory  power  of  the  questions  while  the  DIG 
method  correctly  orders  them  according  to  the  information  present  in  the  question 
for  discrimination  between  the  groups.  The  worth  of  a question  which  discriminates 
between  the  groups  in  a non-ordinal  way  is  asscsed  by  DIG  but  not  the  other  two. 

This  is  because  the  DIG  procedure,  measures  the  "distance"  between  the  two  probability 
measures  corresponding  to  the  two  groups,  and  not  the  Euclidean  distance  between 
two  real  numbers  (centroids)  which  ostensibly  represent  the  groups. 


i 


Table  3:  Question  Weights  for  Three  Methods. 


Question 

Method 


5 6 


9 10 


Table  4:  Probability  Structure  of  the 
simulated  questionnaire  II 


Question  No 

. Probabilities  of  response 

of  group  1 to  part 

I II  III  IV  V 

Probabilities  of  group  2 
to .part 

I II  III  IV  V 

1 

25 

10 

30 

10 

25 

. 10 

35 

10 

35 

10 

2 

35 

10 

10 

10 

35 

10 

35 

10 

35 

10 

3 

35 

« 

10 

10 

10 

35 

10 

25 

30 

25 

10 

4 

25 

10 

30 

10 

25 

10 

25 

30 

25 

10 

S 

10 

25 

30 

25 

10 

10 

io 

60 

10 

10 

6 

10 

35 

10 

35 

10 

10 

10 

60 

10 

10 

7 

20 

20 

20 

20 

20 

10 

10 

60 

10 

10 

8 

35 

10 

10 

10 

35 

10 

10 

60 

.10 

10 

9 

35 

10 

10 

10 

35 

10 

20 

40 

20 

10 

10 

35 

IC 

10 

10 

35 

10 

30 

20 

30 

10 

Tabic  5 shows  the  values  of  c(i,j)  and  Table  6 shows  the  values  of  the 
question  " weigh to" . Note  that  the  SPSS  and  Fisher  methods  distort  the  classification 
and  the  determination  of  important  discriminatory  questions.  This  is  because  the 
centroid  methods  cannot  distinguish  the.  groups  when  they  have  this  symmetric  response 
pa  l tern.  Never  I lu-less  the  ( wo  prohahi  I i I y d i s I r ihut  ions  are  some  "d  i st  ance"  apart . 

Table  5:  c(i,j).  Error  of  Misclassif ication  in  % 

of  Group  i into  Group  j for  three  methods: 

100  samples  in  each  group.  Simulation  II 


Fisher 

SPSS 

DIG 

t (1,2) 

44 

41 

1 

Error 
in  I 


15. 


I 


Table  6:  Question  Weights  fqj  three  Methods: 
Simulation  II 


Rao 

(F (1 , 194) 

SPSS 

F (I ,198) 

DIG  (X2 (4) ) 

1 

.182 

.0247 

44.26 

2 

.134 

.1643 

56,08 

3 

6.444 

2.909 

81.45 

4 

.168 

.4995 

14.7 

S 

.058 

.0396 

28.1 

6 

.356 

.0732 

66.57 

7 

.002 

.0035 

44.87 

8 

.001 

.0218 

110.13 

9 

.76 

. 2262 

69.33 

10 

2.49 

1.1069 

77.11 

The  simulations  were  run  again  using  10,000  samples  in  each  group.  For  DIG, 
e (1, 2)  * 57.,  e(2,l)*47.;  for  SPSS,  e(i,  j)=49X.  The  result  again  indicates  that 
SPSS  will  be  less  reliable  than  DIG.  The  DIG  weights  are  also  more  reliable  in- 
dicators of  question  discriminatory  power. 

f6.  Applications 

We  shall  present  several  examples  in  which  the  information  theoretic  methods 
of  analysis  were  used  on  actual  data. 

We  consider  first  a psychiatric  screening  questionnaire  developed  by  Dr.  H. 


~ r 

i | 


- 


David lan.  Head  of  the  Department  of  Psychiatry,  University  of  Tehran,  Iran.  In 

this  questionnaire  the  permitted  responses  to  questions  such  as  "Do  you  feel  restless?" 


v-.*-4  .>  *vV  Ati v* 


are  "never",  "occasionally",  "frequently",  "always".  The  questionnaire  was  designed 
so  that  each  question  measures  the  degree  of  some  aspect  of  the  respondent's  "mental 
stress" ; the  question  responses  are  all  ordered  by  degree  in  the  same  direction; 
"mentally  ill"  patients  are  presumed  to  respond  to  the  "high"  end,  "normals"  at  the 
"loV  end.  Since  this  is  so,  the  two  group's  score  centroids  should  be  well  separated 
and  consequently,  both  SPSS  and  DIG  should  discriminate  well. 

The  questionnaire  consisted  of  46  questions  and  was  designed  for  the  purpose  of 
classifying  each  respondent  as  "mentally  ill"  or  "normal".  It  was  given  to  143  re- 
spondents, 90  of  whom  were  classified  prior  to  the  administration  of  the  questionnaire 
as  "mentally  ill"  while  the  rest  were  "normal".  The  values  assigned  to  the  responses 
were  0 for  "never",  1 for  "occasionally",  2 for  "frequently"  and  3 for  "always". 
SPSS  was  run  using  this  raw  scoring  technique.  Essentially  SPSS  and  DIG  behave  the 
same  for  this  nice  ordinal  data.  The  question  weights  developed  by  SPSS  and  DIG,  as 
expected,  gave  essentially  the  same  assessment  of  question  worth.  The  question  weights 
were  used  to  shorten  the  questionnaire  as  outlined  in  $4.  The  twenty-two  questions  with 
the  highest  weights  were  selected.  (Psychiatric  technical  considerations  also  played 
a part  in  the  choice  of  these  questions.)  Using  this  new  "reduced"  questionnaire  it 
was  found  that  e(l,2)  = e(2,l) = 0 for  the  DIG  analysis,  while  for  the  SPSS  analysis 
e(l,2)*0  and  e(2,l)=4%.  This  questionnaire  has  been  used  for  screening  purposes 
in  Iran.  The  use  of  thr  reduced  questionnaire  has  resulted  in  a considerable  saving 
in  time  over  the  original  questionnaire.  (Note:  in  all  of  these  simulations  the 
apparent  error  rate  is  used,  and  hence  is  optimistically  biased.  Still  the  results 
are  encouraging.) 

The  second  set  of  data  upon  which  these  methods  have  been  used  involved  a survey 
conducted  by  the  Pan  American  Health  Organization  (PAHO)  on  child  mortality  in  1969- 
1970  in  South  American  countries.  Among  the  questionnaires  was  one  covering  the  socio- 
economic status  of  a household  and  various  environmental  factors.  Some  questions  were 
"What  type  of  water  supply  do  you  have?"  with  such  permitted  responses  as  "piped 
water",  "well",  "rain  water";  "What  is  the  marital  status  of  the  mother?",  "married". 


"divorced",  "separated".  We  note  here  tha'.  many  of  these  questions  had  answers  which 
were  not  essentially  ordinal  and  hence  reversals  may  occur,  and  a linear  discriminant 
function  may  be  Inappropriate.  (Because  of  a contractual  agreement  between  PAHO  and 
the  World  Health  Organization,  Geneva,  where  this  data  was  analyzed  by  one  of  us  (A.L.) 
we  are  not  able  to  disclose  details  of  this  questionnaire  or  of  the  analyses.)  We  ana- 
lyzed twelve  questions  from  this  questionnaire.  Group  1 consisted  of  all  those  house- 
holds where  a child  under  5 died  of  malnutrition  or  diarrhea.  The  second  group  con- 
sisted of  households  where  a child  died  from  other  causes.  In  all  there  were  952  house- 
holds, however,  complete  information  was  available  only  on  154  in  group  1,  37  in  group 
2.  We  chose  only  those  questionnaires  without  missing  responses  since  SPSS  needs  a 
special  program  for  missing  data  (DIG  does  not)  and  we  wanted  a direct  comparison  be- 
tween the  two  methods.  The  proportion  of  answers  of  each  group  to  each  response  is 
given  in  Table  7.  From  Table  8 we  find  that  for  DIG,  e(l,2)  = 12,  e(2,l)  = 24  and  for 
SPSS  e(l,2)*26,  e(2,l)*25.  It  should  be  noted  that  the  results  of  SPSS  and  DIG  agree 
upon  which  of  the  questions  are  significantly  discriminating  except  for  question  9 which 
DIG  found  to  be  highly  discriminating  and  SPSS  found  to  be  not  discriminating.  The  im- 
portance of  question  9 was  lost  to  SPSS  since  it  discriminated  in  a non-ordinal  manner. 
Consequently,  if  one  uses  SPSS  on  data  which  is  not  essentially  ordinal,  one  may  unin- 
tentionally eliminate  significant  variables. 


Table  7:  Simulation  of  response  probabilities  for  Pan 
American  Health  Organization  survey 


Question  No. 


Probability  of  response 
of  group  1 to  part 

I II  III  IV  V 


Probability  of  response 
of  group  2 to  part 

I II  III  IV  V 


17 

37 

46 

0 i 

76 

22 

2 

0 l 

1 69 

25 

6 

0 i 

93 

7 

0 I 

i 19 

16 

65 

0 i 

> 39 

15 

23 

23  I 

29  66 


55 

50 

• 5 

47 

45 

8 

24  15 


92  8 


32  17 


10  24 


32  18 


11  34 


99  1 


16  28 


. 66 


16 


19 


34 


18  16 


13  24 


-Table  8:  Error- (*»)  of  Misclassif ication: 

Simulat ion  m 
100  samples  in  each  group. 


Method 

Error 

Fisher 

SPSS 

DIG 

*(1.2) 

j 

31 

26 

1 

12 

c (2 » 1) 

32 

1 

25 

24 

Table  9:  Question  Weights:  Simulation  ill 


Quest 

Method 

ion 

l 

2 

3 

4 

5 

6 

TV  " ” 

7 

8 

9 

10 

• 11 

12 

Rao 

12.6 

4.48 

0 

4.9 

1.89 

.48 

.128 

1.39 

.808 

2.41 

11.5 

.09 

SPSS 

24.9 

10.6  . 

966  7.96 

2.82 

.929 

.07 

4.78 

.106 

3.95 

27.7 

2.82 

DIG 

34.3 

30.8  4 

.85 

8.45 

4.58 

4.45 

8.11 

5.52 

11.2 

3.95 

28.1 

2.82 

As  a final  example  of  how  this  method  has  been  used  we  briefly  sketch  the 
following:  the  acceptability  group  in  the  human  reproduction  division  of  the 
World  Health  Organization  has  used  the  DIG  technique  to  assess  the  feasibility  and 
acceptability  of  various  inodes  of  contraception.  In  particular  they  used  this 
method  to  determine  for  which  groups  of  people  a paper  birth  control  pill  is 
acceptable.  Various  factors  affect  the  acceptability  of  the  paper  pill.  For 
example,  the  persons  involved  may  refuse  to  eat  paper,  the  climate  may  be  such 
that  the  paper  cannot  be  kept  dry,  or  the  life  style  may  be  such  that  the  paper 
sheets  cannot  be  kept  clean.  On  the  other  hand  if  the  paper  pill  is  acceptable 

i 

in  a particular  region,  it  reduces  costs  and  is  easier  to  store  and  administer. 

A categorical  questionnaire  was  designed  to  ascertain  tdiich  groups  would  accept  the 


paper  pill  and  was  given  to  samples  in  Alexandria  and  Cario  (in  Egypt),  Cariche  and 
Ibaden  (India),  Manilla  (Philippines),  Stockholm  (Sweden)  and  Bangkok  (Thailand). 

It  was  desired  to  find  which  questions  or  factors  discriminated  between  those  res- 
pondents who  accepted  the  paper  pill  and  those  who  did  not.  Also  it  was  desirable 
to  know  how  effective  each  question  was  in  distinguishing  between  the  groups. 

Both  SPSS  and  DIG  analysis  was  performed  on  the  data  using  sample  sizes  of 
about  200  in  each  country.  Both  produced  the  same  set  of  discriminating  questions, 
and  approximately  the  same  error  rates  (20-30%).  We  cannot  present  a detailed  des- 
cription of  the  data  collected  since  this  study  is  still  ongoing,  and  the  WHO  has 
priority  on  the  publication  of  the  exact  data.  Nevertheless,  this  example  and  pre- 
vious examples  show  how  Information  theory  has  been  successfully  applied  to  real  data. 
For  further  information  on  the  contraceptive  study,  contact  Dr.  Cri  Kars,  Human  Re- 
production Division,  WHO,  Geneva,  Switzerland. 

There  is  also  a user's  manual  being  produced  at  the  WHO  by  Busca  and  Diethelm 
which  contains  a computer  package  to  implement  the  DIG  analysis. 


References 

1.  Bishop,  C.R.  and  H.R.  Warner  (1969):  "A  Mathematical  Approach  to  Medical 

Diagnosis:  Application  to  Polycythemic  States  Utilizing  Clinical  Findings 
with  Values  Continuously  Distributed",  Computers  and  Biomedical  Research 
2,  486-493.  ~™”~  ~ 

2.  Boyle,  J.A.,  W.R.  Grieg,  D.A.  Franklin,  R.M.  Harden,  W.W.  Buchanan  and  E.M. 

McGirr  (1966):  "Construction  of  a Model  for  Computer-assisted  Diagnosis: 
Application  to  the  Problem  of  Non-toxic  Goitre",  Quarterly  Journal  of 
Medicine.  N.S.  35,  565-588.  " ' 

3.  Brockett,  P.L.,  A.  Charnes  and  W.W.  Cooper  (1979):  "MDI  estimation  via 

unconstrained  convex  programing".  Center  for  Cybernetic  Studies  Report 
CCS  326,  The  University  of  Texas  at  Austin. 

4.  Brockett,  P.L.,  P.  Haaland  and  A.  Levine  (1977a):  "Discriminant  Analysis 

for  Categorical  Questionnaire  Data",  Tulane  University  mathematics  de- 
partment preprint. 

5.  Brockett,  P.L.,  P.  Haaland  and  A.  Levine  (1977b):  "A  characterization  of 

divergence  with  applications  to  Questionnaire  Information",  to  appear 
Information  and  Control . * 


21. 


*42 


Buses,  B.  and  P.  Diethelm  (1978):  "Discriminant  Analysis  using  Information 
Gain  (DIG)",  manual  for  WHD,  in  preparation. 

Cooley,  W.W.  and  P.R.  Lohnes  (1971):  Multivariate  Data  Analysis.  John  Wiley 
and  Sons,  New  York,  Chapter  9. 

Fisher,  R.A.  (1936):  "The  use  of  multiple  measurements  in  taxonomic  problems", 
Ann.  Eugen . , 7,  179. 

Gokhale,  D.V.  and  S.  Kullback  (1978):  The  Information  in  Contingency  Tables. 

New  York,  Marcel  Dekker,  Inc. 

Goldstein,  M.  and  W.R.  Dillon  (1977):  "A  stepwise  discrete  variable  selection 
procedure".  Comm.  Stat.,  Theory  and  Methods  6,  1423-1436. 


Sons . 


(1978):  Discrete  Discriminant  Analysis.  New  York,  John  Wiley  and 


Hills,  M.  (1966):  "Allocation  rules  and  their  error  rates",  J.  Roy.  Stat.  Soc. 

B 28,  1. 

Kalton,  Graham,  M.  Collins  and  L.  Brook  (1978):  "Experiments  in  Wording  Opinion 
Questions",  Applied  Stat.,  27,  No.  2,  149-161. 

Kullback,  S.  (1959):  Information  Theory  and  Statistics,  New  York,  John  Wiley 
and  Sons,  Dover  Press  (1968),  New  York. 

Lachenbruch,  P.A.  (1975):  Discriminant  Analysis.  Hafner  Press,  New  York. 

Levine,  A.  (1974):  "A  new  approach  to  Discriminant  Analysis  in  Screening 

Questionnaires",  Int.  Symp.  on  Epidemiological  Studies  in  Psychiatry.  Tehran. 

Moore,  D.H.  II  (1973):  "Evaluation  of  five  discrimination  procedures  for 
binary  variables",  J.  Ann.  Stat.  Assoc.,  68,  339-404. 

Nugent,  C.A.,  H.R.  Warner,  J.T.  Dunn  and  F.H.  Tyler  (1964):  "Probability 
Theory  in  the  Diagnosis  of  Cushing's  Syndrome",  The  Journal  of  Clinical 
Endocrinology  24,  621-627. 

Oppenheim,  A.N.  (1966):  Questionnaire  Design  and  Attitude  Measurement.  New 
York,  Basic  Books,  Inc. 

Payne,  S.L.  (1951):  The  Art  of  Asking  Questions.  Princeton  University  Press. 

Reale,  A.,  G.A.  Maccacaro,  E.  Rocca,  S.  D'Intino,  P.A.  Geoffre,  A.  Vestri  and 
M.  Motolese  (1968):  "Computer  Diagnosis  of  Congenital  Heart  Disease",  Com- 
puters and  Biomedical  Research.  1,  533-549. 

Rao,  C.R.  (1970):  "Inference  in  discriminant  function  coefficients".  Essays 
in  Probability  and  Statistics,  R.C.  Bose,  etal,  eds.,  Chapel  Hill,  Uni- 
versity of  North  Carolina  and  Statistical  Publishing  Society,  pg.  487-602. 

Warner,  H.R.,  A.F . Toronto,  L.G.  Veasey  and  R.  Stephenson  (1961):  "A  Mathe- 
matical Approach  to  Medical  Diagnosis:  Application  of  Congenital  Heart 
Disease",  Journal  of  the  American  Medical  Association.  177,  177-183. 


Unclassified 


DOCUMENT  CONTROL  DATA  - R & D 


S>’i'nnt\  . /.is  of  title,  /»«../»  >>f  .i/isfr.u  t .1/1./  index  in,'  .inn,*tnt:<<n  must  /«'  cMfi-n  ./  v%/n-f»  //»«•  overall  report  is  classified) 


1 oui'.ima  : ini.  A(  1 1 v 1 1 v n'orporate  .niihur) 


HKl’OH  r SLCURITY  CL  A5SITIC  A 1IQN 


Center  for  Cybernetic  Studies 
The  University  of  Texas  at  Austin 


Unclassified 


i hi  com  r » 1 r 1 i 


Information  Theoretic  Analysis  of  Questionnaire  Data 


4 L'lTSCiMI'livf  NOTES  (Type  ol  report 


'i  Au  THQRiSi  (t-'irst  mime,  middle  iiuti.it.  lust  name) 


P.  Brockett,  P.  Haaland,  A.  Levine 


»,  R|  PORT  D A I I. 


March  1979  / 


.»  CON  IN  AC1  OK  ON  ANT  NO. 

N00014-75-C-0569  & 0616 

h.  PROJEC  T NO 

NR047-021 


IO  DISTWinuTlON  STATEMENT 


7 11.  TOTAL  NO  OF  PAGES  | 7/i.  NO.  OF  REFS 


9ii.  ORIGINATOR'S  REPORT  NUM0ER(S| 

Center  for  Cybernetic  Studies ^ 

Research  Report  CCS  336 


>*h.  OTHER  REPORT  NO(S>  (Any  other  numbers  that  may  be  assigned 
this  report) 


This  document  has  been  approved  for  public  release  and  sale,  its  distribution  is 
unlimited. 


\2  SPONSORING  MILITARY  ACTIVITY 

Office  of  Naval  Research  (Code  434) 
Washington,  DC 


1 3\  AOS1HACT 


We  consider  three  important  problems  in  the  analysis  of  categorical  questionnaire 
data.  First,  assessment  of  question  worth  and  variable  selection,  second,  the 
assessment  of  question  validity  using  a pretest,  and  third,  discrete  disciminant 
analysis  when  the  data  is  non-ordinal.  The  unifying  approach  used  throughout 
is  the  concept  of  information  theoretic  distance  measures.  Simulations  and 
appplications  to  real  data  are  presented. 
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